Mining Domain Specific Words from Web Documents

نویسنده

  • Jing-Shin Chang
چکیده

Web pages provide not only plain text materials for training language models but also tag information for semantics annotation. The tags could be found either explicitly in the HTML documents or implicitly through the directory hierarchy of the documents, since the directory hierarchy can be regarded as a kind of classification tree for web documents, which assigns an implicit hidden tag to each document and hence the embedded words. For instance, the domain-specific words for documents under the “sport” hierarchy are likely to be tagged with a “sport” tag. These tags, in turn, can be used in various word sense disambiguation (WSD) tasks and other hot applications like anti-spamming mail filters. Such rich annotation provides a useful knowledge source for mining various semantic links among words. This presentation proposes a statistical method for finding domain-specific words in particular domains, and thus their associations, by taking advantages of the hierarchical structure of the web pages. With the statistical model, the document tree can virtually be converted into a large semantically annotated lexicon tree. Some preliminary results show that the current approach has its strength in finding domain-specific words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dictionary-based Sentiment Analysis Applied to Specific Domain using a Web Mining Approach

In recent years, the Web and social media are growing exponentially. We are provided with documents which have opinions expressed about several topics. This constitute a rich source for Natural Language Processing tasks, in particular, Sentiment Analysis. In this work, we aim at constructing a sentiment dictionary based on words obtained from web pages related to a specific domain. To do so, we...

متن کامل

Expert Discovery: A web mining approach

Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Classification of Web Documents Using Concept Extraction from Ontologies

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed onto...

متن کامل

Improving Classification of Multi-Lingual Web Documents using Domain Ontologies

In this paper, we deal with the problem of analyzing and classifying web documents to several major categories/classes in a given domain using domain ontology. We present the ontology-based web content mining methodology that contains such main stages as collecting a training set of labeled documents from a given domain, building a classification model above this domain given the domain ontolog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004